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ABSTRACT 


Question answering forums in online learning environments 
provide a valuable opportunity to gain insights as to what 
students are asking. Understanding frequently asked ques- 
tions and topics on which questions are asked can help in- 
structors in focusing on specific areas in the course content 
and correct students’ confusions or misconceptions. An un- 
derlying task in inferring frequently asked questions is to 
identify similar questions based on their content. In this 
work, we use hierarchical agglomerative clustering that ex- 
ploits similarities between words and their distributed rep- 
resentations, reflecting both lexical and semantic similarity 
of questions. We empirically evaluate our results on real 
world labeled dataset to demonstrate the effectiveness of 
the method. In addition, we report the results of inferring 
frequently asked questions from discussion forums of online 
learning environment providing lectures to middle school 
and high school students. 


Keywords 
frequently asked questions, agglomerative clustering, ques- 
tion similarity, community question answering. 


1. INTRODUCTION 


Self-paced online learning environments provide valuable learn- 


ing resources to a large number of students. A primary 
mechanism of interactions between the students are the dis- 
cussion forums. These forums enable students to ask ques- 
tions, answer questions and collaboratively learn. Ques- 
tion answering forums, are discussions forums where every 
thread is a question posted by a student - much like the com- 
munity question answering (CQA) platforms such as Stack- 


Proceedings of the 10th International Conference on Educational Data Mining 


Overflow’, Quora?. Over time, a large number of students 
may post similar questions that could indicate topics suscep- 
tible to confusions, misconceptions or course content requir- 
ing further explanations. Most question answering forums 
allow a student or user to search similar questions present 
in the archives, using information retrieval technique. While 
searching similar questions is useful for a student, it provides 
limited view to an instructor on frequently asked questions. 
A potential way to aid manual identification of common or 
frequently asked questions, in such forums is to employ clus- 
tering, so that semantically related questions are grouped 
together. 


Motivating Example: Table 1 lists examples of sample groups 
of similar questions posed by middle and high school stu- 
dents on Khan Academy*. These groupings or question 
clusters can help an instructor identify key concerns or con- 
fusions among students. The instructor could address con- 
fusions by providing additional content on the specific topic. 
For example, many students are asking questions on the 
slope of vertical or horizontal line. Having a view of ques- 
ion clusters, can be valuable to the instructor and help in 
refining course content. 


k-means++ [9] need prior information about the number of 
clusters required. Providing number of clusters as input can 
be very hard for the instructors. Hence, in this work we 
use hierarchical clustering [9] that does not have an input 
requirement. Dendrograms (a tree of clusters), that cap- 
ure results of hierarchical clustering, can allow instructors 
o extract clusters of different granularities without having 
o re-run the clustering algorithm. Further, most algorithms 
of hierarchical clustering, provide the flexibility to choose a 
distance metric that we utilize in this work. 


Existing work on processing CQA archives, identify or rank 
similar questions given a new question [12]. While the prob- 
lem of estimating relevance of questions to address a new 
question is a related to estimating similarities between ques- 
tions to identify clusters, much of the work done to address 


lwww.stackoverflow.com 
2 www.quora.com 
3.www.khanacademy.org 


Partition-based clustering methods such as k-means, k-mediods, 
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Table 1: Examples of frequently asked questions. 


C# | Video Lecture Student Questions 


What would the line look like if the slope was a zero? 


What is the slope of a horizontal line? 


Cl | Graphing a line in slope intercept form | what about vertical lines? do they have slope? 


Would a vertical line imply an undefined slope, and would a horizontal 
line imply a zero slope? 
Why not use L’hopital’s rule? 


can you use lhopital ’s rule to prove this limit ? 


C2 | Proof of Limit sin(x)/ax you can also use lhopital ’s rule to turn sina/x turn into into cosx/1 


Can you also prove this limit using L’Hopital’s rule? 


z—>0=1 


Just use hopital’s rule for that ...sin(x)/a == cos(x)/1 and cos() for 


the former problem, uses supervised learning approaches 
that require labeled datasets for training and building mod- 
els. 


Our Contributions: We address the problem of inferring fre- 
quently asked questions (FAQ) by harnessing a distance met- 
ric that that uses the similarity of the words in the question 
using a lexical database (such as WordNet*) and the word 
embedding space representation that depicts contextual sim- 
ilarity of words. We further provide a flexible way of cutting 
the output of the clustering algorithm, dendrogram, allow- 
ing the end user to identify clusters of questions. A range, 
specifying the number of points needed to define a cluster is 
taken as input. The generated clusters are sorted by the dis- 
tance metric, thus enabling instructors to filter and identify 
relevant question clusters. 


2. RELATED WORK 


In this section we position our work in the context of existing 
literature along two directions: (1) Analyzing textual con- 
tent available in student discussion forums, (2) Processing 
questions in community-based question answering (CQA) 
systems. 


2.1 Student Discussion Forums 
There has been a growing body of research on analyzing 
the textual discussion forum data in Massively Open Online 


Courses (MOOCs). 


A precursor to analyzing questions is determining the ut- 
erance of students or classifying the dialog act of the stu- 
dents (such as asking questions, giving feedback or agreeing 
and disagreeing). Ezen-can et al. [4], apply k-medoids clus- 
ering algorithm and qualitatively evaluate the clusters to 
group dialog acts and topics. In our work, we analyze posts 
hat are categorized as questions. Topic analysis of MOOC 
discussion content using Structural Topic Model (STM) has 
been explored by Reich et al. [15]. While topic labels are 
useful in providing a broad overview of the themes that are 
attracting student discussions, they do not help the instruc- 
or in analyzing finer details of what students are asking or 
answering. In one of the recent work Thushari et al. [2], 
present a ‘topic-wise organization’ of discussion posts by us- 
ing Latent Dirichlet allocation (LDA) on the discussion data. 
The authors present a topic visualization dashboard that 


“https: //wordnet.princeton.edu/ 
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would assist MOOCs staff in understanding emergent dis- 
cussion themes or identifying popular topics [1]. Our work 
uses questions in the student question answering forums and 
evaluates the semantic similarity between pairs of questions 
to identify similar question clusters. The work presented 
here can be used on the subset of discussion posts that have 
been tagged or organized into a topic. 


In addition, discussion forum data has been utilized for a 
wide variety of purposes, recent among these is the analysis 
of information seeking behavior of students (that includes 
querying, refining the query, reading and browsing), while 
they learn programming [8]. Sentiment analysis in discus- 
sion forums [18], examining relationship between students’ 
discussion behaviors and their learning [17] [6], explore var- 
ious possibilities of using the forum as a rich source of data. 


2.2 Community Question Answering (CQA) 
The popularity of CQA indicates that users find them use- 
ful in finding answers to their questions. However, there are 
several issues related to CQA that has led to a large body of 
research: 1) Identifying good and relevant answers to ques- 
tions can help users filter noise in the responses. 2) Identi- 
fying questions that may be repeated or closely related to 
previously asked questions can help eliminate redundancy. 
The latter issue, relates very closely to the problem we ad- 
dress in our work. 


One of the recent tasks in SemEval 2016 [12] dealt with iden- 
tifying and ranking a set of 10 related questions given a new 
question. The participating teams in the task, built super- 
vised machine learning models that used distributed repre- 
sentation of words, knowledge graphs to define lexical and 
semantic features [5], neural network approaches including 
convolution neural nets (CNN) or Long short term memory 
(LSTM) networks [11], [16], [13]. The focus of their work is 
to rank the questions in a relevant manner considering se- 
mantic similarity. A prerequisite to using these approaches 
in practice, is the need of a labeled dataset. In our work, we 
use an unsupervised method that circumvents the need for 
labeled data. 


Clustering questions answers (QA) from the CQA systems to 
ease tasks such as tagging has been less explored. In one of 
the recent works [14], the authors identify clusters of related 
QA. The approach is based on classical k-means clustering 
algorithm, but mixes the similarities of the questions and 
answers to define an objective function that is optimized over 
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————-+| Preprocessing Spell Checking 


Lemmatization 


questions . { Stopword removal 


Lexical similarity using dictionary 


Distance metric Word embedding similarity 


Question-Question { 


Popa, Single Linkage clustering 
Clustering Complete Linkage clustering 


# of questions per 
cluster [range] 


Dendrogram 


Figure 1: Identifying commonly asked questions. 


multiple iterations. While our goal is to cluster questions 
and use an unsupervised model, we do not rely on the answer 
information, primarily because the answers given by peers 
students may contain irrelevant information, especially with 
students from middle school. 


3. IDENTIFYING COMMON QUESTIONS 


Our method to infer or identify commonly asked questions 
is organized into multiple steps, as shown in Figure 1. The 
first step deals with preprocessing the question to remove 
any noise. Next, we focus on the key aspect of any clustering 
algorithm; the choice of (dis)similarity function or distance 
metric between a question pair. The hierarchical clustering 
algorithm uses the distance metric to derive the output as 
a dendrogram. Finally, the dendrogram is partitioned and 
the clusters are identified. 


3.1 Preprocessing 

In the preprocessing phase, for each question we filter all 
URL, email addresses or other similar such patterns which 
may be irrelevant in the context of the data being analyzed. 
The misspellings are corrected using the WordNet database. 
Stopwords are removed and the remaining words in each 
question are lemmatized to their base forms using the lem- 
matizer provided by Stanford Core NLP parser? 


3.2 Question-Question Distance Metric 

The distance function uses the combination of both the lex- 
ical and word embedding similarity. We define the distance 
metric between question pairs qi, qj; as follows: 


dist(qi,4;) = ((Q- Deow (gi, 93))* + ((1-Q)- Duec(gi, 45)")'/” 
1 


where, Doow(Gi,q;) is the distance computed based on the 
lexical similarity and Dyec is the distance computed based 
on word embeddings for question pair (qi,q;). The follow- 
ing section describes the distance metrics in detail. The 
distance function 2 is the weight associated with lexical or 
word embedding based distance. As stated by the authors 
in [14], the metric represented as (a” + b”)!/* approximates 
to max{a, b} for high positive values of x and to min{a, b} 
for high negative values of x. 


°http://stanfordnlp.github.io/CoreNLP / 
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3.2.1 Lexical Similarity 

Each question is represented as a bag of words vector. The 
dimension of the vector being the vocabulary size of the 
question corpus W. Each word w, in the question and its 
associated synonyms are identified from the WordNet lexical 
database. The words are weighted by their idf measure. The 
idf measure is given by 


|D| 

idf (wi) = log ( (2) 
df (wi) 

where, D is the corpus size and df(wi) is the number of 

documents containing w;. Similarity between two question 

Simbow(qi, qj) is computed using the cosine similarity of the 

question vectors. The distance is defined as: 


Doow (Gi, dj) = 1 — Simoow (Gi: Gj) (3) 


3.2.2 Word Embedding Similarity 

Each question is represented as a weighted combination of 
embeddings of words in the question. The word vector vy 
for each word w in the question is identified using the dis- 
tributed representation of words generated by the word2vec 
tool [10]. Each question q is represented as: 


_ 11, 
Ma aa) @) 


Similarity between two question Simyec(qi, qj) is computed 
using the cosine similarity of the question vectors. The dis- 
tance between question pairs qi, q; is defined as: 


Dyec(G, qj) == Simvec(Gi, q;) (5) 


3.3. Hierarchical Clustering 

We use agglomerative hierarchical clustering. Initially, each 
question is in its own cluster. The nearest clusters are 
merged until there is only one cluster left. The end re- 
sult is a cluster tree or dendrogram. The tree can be cut 
at any level to produce different clusters. There are two 
types of clustering methods. The Single Linkage approach, 
merges two clusters by considering the minimum distance 
between the points in clusters to be merged. In Complete 
Linkage approach, two clusters are merged by considering 
the maximum distance between the points in the clusters. 
Complete linkage clustering results in more compact clusters 
as the merge criterion considers all points in the cluster. We 
use complete linkage clustering. The worst case run time 
complexity of agglomerative clustering is O(n? log n) which 
makes it too slow for large datasets. The primary advan- 
tage of the clustering approach is that it does not require 
any prior input to generate the cluster tree. 


We evaluated another clustering algorithm Density-based 
spatial clustering of applications with noise (DBSCAN) [3], 
which has a worst case run time complexity of O(n”). The 
inputs to the DBSCAN, are the minimum number of points 
to form a cluster and the distance threshold eps such that, 
for every point in the cluster, there exists another point in 
the same cluster whose distance is less than the eps. Select- 
ing distance threshold as an input can be a challenge. The 
resulting clusters can vary significantly with eps. 
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Cluster 
distance 


Data points 
(a) Dendrogram 


Input cluster size range = [3,4] 


Cluster | Rank | Points 

C1 di {8,9,10,11} 
c2 d2 {4,5,6,7} 
@ d3 {1,23} 


(b) Resulting clusters 


Figure 2: (a) Dendrogram (b) Clusters identified for 
input range of number of points. 


3.4 Dendrogram 

The output of the hierarchical clustering is a dendrogram 
as shown in Figure 2(a). A typical approach is to cut the 
dendrogram at a specific distance and identify the resultant 
clusters. However, a dendrogram can be cut at different dis- 
tances based on the domain or application specific informa- 
tion. In our scenario, an important input from the instruc- 
tor, is the minimum number of points or questions in cluster, 
for it to be considered as a FAQ. An instructor may decide, 
that she would like to address groups of at least 4 similar 
questions, or provide a range of question sizes as input. Fig- 
ure 2(b) depicts such a scenario of wanting a range of [3, 4] 
questions in each cluster. We use number of questions as the 
input and provide a list of question clusters sorted by the 
cluster distance. Hence, clusters that are linked with lower 
distance values form good quality clusters. As the distance 
function increases, the quality of the resulting cluster would 
be poor. 


4. EXPERIMENTAL EVALUATION 


In this section, we evaluate our method for identifying FAQ. 
We use a labeled data set from a CQA archive and create 
reference clusters. 


4.1 Data 


To evaluate the suitability of our approach, we use SemEval 
2016 Task 3 dataset that contains questions and answers 
from Qatar Living forum [12]. The data relevant for our eval- 
uation contain questions categorized as Original question. 
For each original question, a set of 10 related questions are 
annotated as PerfectMatch, Relevant and Irrelevant. Using 
the labeled information, we build a set of reference clusters 
or ground truth, which contain the original question and the 
related questions that are either PerfectMatch or Relevant. 
Table 2 contains the details of the data set. The test dataset 
contained of 770 questions. 


Proceedings of the 10th International Conference on Educational Data Mining 


Table 2: SemEval 2016 Task3 dataset used. 


Questions Training | Test 

Original Questions 200 70 

Total 1,999 700 

Related Questions aise oe soi 
PerfectMatch 181 81 

Irrelevant 1,212 467 

Total 2,199 770 


4.2 Evaluation Metrics 

The quality of clustering is measured using F-Measure, com- 
bining the precision and recall scores used in information re- 
trieval [7]. Each generated cluster Cyen is treated as a result 
of the query and each reference cluster C;.¢s is considered as 
the desired set of documents or points: 


— Coen N Cref 


precision(Cgen, Cref) ae. ar (6) 
Coen 
Poult ear Cgen O Cref (7) 
Cref 


2+ precision - recall 


F — Measure(Cgen, Cref) = (8) 
The average precision, recall and F-Measure values are com- 
puted for each cluster containing the “original question”. For 
the purpose of evaluation, we use the test data set and iden- 
tify the partition or the distance threshold at which the max- 
imum average F-Measure is obtained. 


precision + recall 


4.3 Results 


The results of our approach are presented in Figure 3. We 
evaluate the cluster measures by considering the question- 
question distance metric using various values of Q and a. 
High F-Measure and recall is achieved when we use lexi- 
cal similarity as the primary distance metric. Using word 
embedding as a primary similarity metric results in higher 
precision, which could be suitable in scenarios where the 
data is noisy or contains large number of irrelevant ques- 
tions. Figure 3(a) has varying weights associated to lexical 
and word embedding based similarity. When x = 0.5, a 
balance between high precision and high recall is achieved. 
Further, Figure 3(b), shows the metrics achieved by varying 
Q. Here, the best results are achieved with Q = 4, with 
an F-measure of 0.653, a precision of 0.874 and recall of 
0.5609. The SemEval 2016 Task 3 participants reported un- 
official precision, recall and F-Measure values. Here, for each 
original question, Relevant’ and PerfectMatch questions are 
categorized as true pairs and Irrelevant questions are cate- 
gorized as false pairs. The precision values reported by the 
top 4 participants ranged from 0.636 to 0.763. The recall 
values were higher and ranged from 0.553 to 0.759. The 
F-Measure was between 0.64 and 0.71. The results of our 
method are comparable and encouraging as we have used an 
unsupervised model. 


5. INFERRING FAQ FROM STUDENT QA 
SYSTEM 


In order to verify the relevance of the approach, we ran 
the clustering tool on a student question answering plat- 
form. The dataset for the analysis, was extracted from the 
Khan Academy, by permission, using screen scapping pro- 
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F-Measure 


0.6126 


0.75 1 v) 0.25 05 0.75 1 0 0.25 0.5 0.75 1 


0.6319} 
0.5609} 


Recall 


Precision 


0.8639) 


0.8659 


Q, x=1 Q, x=1 


(a) « =1, varying 2 from [0,1] 


F-Measure 


0.5598 0.5609 


0.9353 0.9359 


0.9359 9.9353 0.9353 


Precision 
2 
@ 
3 
5 
a 
Recall 


0.5152 


x, O=0.5 x, O=0.5 


Table 3: Sample FAQ inferred using proposed method from Khan Academy question answering forum. 


(b) Q = 0.5, varying x from [-5,5] 


Figure 3: F-Measure, Precision and Recall values by varying 2 and x. 


C# 


Video Lecture 


Student Questions 


Cl 


Graphing a line in 
slope intercept form 


what do the b stand for in the equation y = mx + b? 


what do the m stand of in y = mx + b? 


why do we use m and b in the equation y = mx + b ? 


what if the m and the b be zero in the equation y = mx + b ? 


C2 


Introduction to limits 


isn’t 0/0 indeterminate not undefined 


Sal said that 0/0 is undefined. Shouldn’t it be not a number? 


At 1:18, why is 0 divided by 0 undefined? My teacher taught us it’s 0... 


is 0/0 undefined, or one? and Why? 


T thought that 070 is called a indeterminant not undefined. Correct my logic please 


WHY is anything divided by 0 considered as undefined?? 


C3 


Definition of function 


I’m trying to understand but, I see what he is doing but what ever he is saying is in slow motion 
so I don’t understand. And what is a piecewise function 


Do you have a video where they give you a graph of a piecewise function, but need to find the 
rule? 


How to find inequalities for piecewise functions? 


How do you graph piecewise functions? 


what is a piecewise function? 


C4 


Proof of sin x by x 


im a class 9 student and dont have 100% knowledge on trigonometry (just went through his 
videos once) so i dnt get what i am missing here: should he prove that for 3rd and 2nd quadrant 
as well?! 


Is this statement is not applicable to 2nd &3rd quadrants 7? Why? 


exactly why does this only apply to Ist and 4th quadrant why not, 2nd and 3rd? 


what about the 2nd and 3rd quadrants? 


X would not be negative in the 4th quadrant.,x is only negative in 2nd and 3rd quadrant. 


why is he working in the first and fourth quadrants only? because the absolute value remains 
the same in all quadrants 


@14:22 Khan says that cos(x) is always the x value in the first and fourth quadrants. Doesn’t 
he mean that cos(x) and x have the same sign in the first and fourth quadrants? 


Why do we consider x only in the first and the fourth quadrant? Does it change the result if we 
need to consider all the quadrants? 


T feel Tike T understand everything except going into the fourth quadrant. From 8:32 to the end 
of the video, he is discussing the fourth quadrant. 


Why go into the fourth quadrant, and why does he stay away from the second and third quadrant? 


why is he working in the first and fourth quadrants only? because the absolute value remains 


the same in all quadrants 


tocol. We considered micro lectures of 8°” grade mathe- 
matics and micro lectures covering differential calculus. On 
the learning platform, each micro lecture video has easy ac- 
cess to the page where questions for that lecture, can be 
asked or viewed. Asking questions is voluntary. Each learner 
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can view questions that have been previously asked by their 
peers. Once a question is asked, a discussion thread is ini- 
tiated with peer students providing answers. The data set 
contains about 22000 questions from 300 video lectures. As 
questions are asked in the context of a given micro lecture, 
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we infer the FAQs for each lecture. This helps us reduce the 
running time of our clustering algorithm. 


5.1 Discussion 

Table 3 presents a subset of the clusters or FAQs extracted. 
Four example clusters or FAQ are presented. We were able 
to extract 4 to 10 questions in each of the sample clusters. 
We observed several clusters with irrelevant questions, that 
resulted from poor semantic match when the question con- 
tent contained numerous mathematical expressions, symbols 


and less text. 


Our results can improve with domain spe- 


cific preprocessing. The current preprocessing step does not 
parse or process mathematical expressions. Identifying ex- 
pressions and tagging them as a special tokens for computing 
question-question distance could provide better results. We 
noticed several abbreviations in the questions, that were not 
handled by our preprocessing step. In addition, many stu- 
dents had questions related to content presented at specific 
time periods in the video lectures. Annotating terms repre- 
senting video lecture time period, as a part of preprocessing 
could help ascertain intervals of time within the lectures, 
where students are seeking more information. Such domain 
specific processing of content in questions could help improve 
the question-question distance metric and reduce noise in the 
generated clusters. 


6. 


CONCLUSION 


Our goal in this work was to identify FAQ from the ques- 
tion answering systems of online learning environments. We 
used agglomerative clustering, an unsupervised learning ap- 
proach, to identify the FAQ as it did not require any prior 


inputs to identify groups of questions. 


A distance metric 


was defined to harnesses similarity based on bag of words 
and word embeddings. Our empirical evaluation on labeled 
dataset shows the effectiveness of our approach, with the 
precision and F-Measure values comparable to the existing 


methods that use supervised models. 


We extracted ques- 


tions asked by students from Khan Academy and FAQ was 
extracted for each topic. In future, we would include the an- 
swers provided by students in identifying similar questions. 
The answers can be filtered based on the votes received, stu- 
dent popularity and other related answers in the posts. This 
would result in improving the quality of extracted FAQ. 


7. 
1 
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